For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
The goal of the analysis is to determine which variables are best at predicting whether a customer will “churn”. Churn means the customer stopped using the company’s product or service. The dataset is for a telecommunications company called Telco so in the context of the data churn means that the customer terminated their subscription. With this analysis, Telco will have a better idea of what factors contribute to their clients leaving and therefore develop a plan to retain them.
This dataset has 7,044 and 20 variables. For this analysis, we will
ignore the CustomerID variable which serves as a unique
identifier for each customer and does not provide any meaningful
information for predicting customer churn.
Author(s): Steven Macko Title: Telco Customer Churn Year: 2018 Version: 1 Publisher: IBM/Kaggle URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
Organizing data can also include summarizing data values in simple one-way and two-way tables.
customerID gender SeniorCitizen Partner
Length:7043 Length:7043 Min. :0.0000 Min. :0.000
Class :character Class :character 1st Qu.:0.0000 1st Qu.:0.000
Mode :character Mode :character Median :0.0000 Median :1.000
Mean :0.1621 Mean :0.517
3rd Qu.:0.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.000
Dependents tenure PhoneService MultipleLines
Min. :0.0000 Min. : 0.00 Min. :0.00000 Min. :0.000
1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.00000 1st Qu.:0.000
Median :1.0000 Median :29.00 Median :0.00000 Median :1.000
Mean :0.7004 Mean :32.37 Mean :0.09683 Mean :0.675
3rd Qu.:1.0000 3rd Qu.:55.00 3rd Qu.:0.00000 3rd Qu.:1.000
Max. :1.0000 Max. :72.00 Max. :1.00000 Max. :2.000
InternetService OnlineSecurity OnlineBackup DeviceProtection
Length:7043 Min. :0.00 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :1.00 Median :1.0000 Median :1.0000
Mean :0.93 Mean :0.8718 Mean :0.8728
3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :2.00 Max. :2.0000 Max. :2.0000
TechSupport StreamingTV StreamingMovies Contract
Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:7043
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.0000 Median :1.0000 Mode :character
Mean :0.9265 Mean :0.8323 Mean :0.8288
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :2.0000 Max. :2.0000 Max. :2.0000
PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
Min. :0.0000 Length:7043 Min. : 18.25 Min. : 18.8
1st Qu.:0.0000 Class :character 1st Qu.: 35.50 1st Qu.: 401.4
Median :0.0000 Mode :character Median : 70.35 Median :1397.5
Mean :0.4078 Mean : 64.76 Mean :2283.3
3rd Qu.:1.0000 3rd Qu.: 89.85 3rd Qu.:3794.7
Max. :1.0000 Max. :118.75 Max. :8684.8
NA's :11
Churn
Min. :0.0000
1st Qu.:0.0000
Median :1.0000
Mean :0.7346
3rd Qu.:1.0000
Max. :1.0000
From this data we can see that our variables have a variety of different values and a wide variety of variable types. CustomerID is a unique identifier for each customer and thus serves no purpose in our analysis so we will remove it. Our two dependent variables are Churn and Tenure. Churn is binary with 0 being Yes they did churn and 0 being no they did not churn. Tenure is continuous and we can see that the max tenure is 72 months and the median is 29 months of being subscribed to Telco’s service. SeniorCitizen, PhoneService, Partner, and Dependents were recoded from yes/no to 0/1 for easier analysis. Similarly, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies were recoded from Yes/No/No Internet to 0/1/2 for easier analysis. Moreover, MonthlyCharges has a max of 118.75 and a median of 70.35 while TotalCharges has a max of 8684.8 and a median of 1397.5
# A tibble: 2 × 2
Churn n
<chr> <int>
1 0 1869
2 1 5174
# A tibble: 3 × 2
InternetService n
<chr> <int>
1 1 1526
2 DSL 2421
3 Fiber optic 3096
#{r, cache=TRUE} #as_tibble(select(telcodf,Churn) %>% ##ggplot(aes(y=n,x=Churn)) + geom_bar(stat="identity")
We can see we have about 73% of the data as no customer churn and 26.5% that have churned. Looking at the potential predictors related to Customer Churn, we strongest relationships between Tenure, MonthlyCharges, and TotalCharges. The rest of the variable comparisons are further down.
Data Viz #2
=======================================================================
We see the largest concentration of values are at the start and end of the histogram, at 0-20 years and 60-73 years. Looking at the potential predictors related to Tenure, the strongest relationship occurs between MonthlyCharges. Although the only two other continuous variables are MonthlyCharges and TotalCharges, MonthlyCharges has a relatively high correlation (.826) while TotalCharges is smaller (.248).The data also appears to be right skewed due to the concentration on the left of the histogram. We can see a large number of values around 73+ due to truncation of the tenure variable or perhaps because of unaccounted for noise. The large number at 0 is likely due to customers staying for less than a year.
The Churn variable is binary and thus cannot be made into a histogram. Based on the Churn bar chart, far more customers have not churned than have churned.
Churn Analysis {data-orientation=rows}
=======================================================================
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| ContractOne year | 0.106 | 0.014 | 7.549 | 0.000 |
| TotalCharges | 0.000 | 0.000 | 6.852 | 0.000 |
| InternetServiceFiber optic | -0.354 | 0.058 | -6.111 | 0.000 |
| PaymentMethodElectronic check | -0.068 | 0.013 | -5.086 | 0.000 |
| PaperlessBilling | 0.045 | 0.010 | 4.495 | 0.000 |
| ContractTwo year | 0.070 | 0.017 | 4.110 | 0.000 |
| tenure | 0.002 | 0.001 | 3.917 | 0.000 |
| SeniorCitizen | -0.044 | 0.013 | -3.419 | 0.001 |
| MultipleLines | 0.059 | 0.024 | 2.403 | 0.016 |
| InternetServiceDSL | -0.143 | 0.076 | -1.895 | 0.058 |
| Dependents | -0.020 | 0.011 | -1.766 | 0.078 |
| TechSupport | -0.044 | 0.025 | -1.754 | 0.079 |
| OnlineSecurity | -0.043 | 0.025 | -1.710 | 0.087 |
| StreamingMovies | 0.066 | 0.045 | 1.460 | 0.144 |
| StreamingTV | 0.064 | 0.045 | 1.416 | 0.157 |
| (Intercept) | 0.616 | 0.461 | 1.337 | 0.181 |
| PhoneService | -0.064 | 0.069 | -0.924 | 0.355 |
| PaymentMethodMailed check | 0.007 | 0.015 | 0.465 | 0.642 |
| OnlineBackup | -0.011 | 0.024 | -0.462 | 0.644 |
| PaymentMethodCredit card (automatic) | 0.006 | 0.014 | 0.448 | 0.654 |
| genderMale | 0.003 | 0.009 | 0.375 | 0.707 |
| MonthlyCharges | 0.001 | 0.004 | 0.303 | 0.762 |
| DeviceProtection | 0.005 | 0.025 | 0.185 | 0.853 |
| Partner | -0.001 | 0.011 | -0.079 | 0.937 |
After examining this model, we determine that there are some predictors that are not important in predicting customer churn, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Linear Regression Model. The variables we removed include DeviceProtection (if customer has Device Protection), OnlineSecurity (If the customer has Online Security), StreamingTV (If they have Streaming TV support), PhoneService (if customer has phone service), Partner (if the customer has a partner), and Gender (Male or Female).
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.006 | 0.051 | 19.555 | 0.000 |
| InternetServiceFiber optic | -0.310 | 0.039 | -8.042 | 0.000 |
| ContractOne year | 0.110 | 0.014 | 7.860 | 0.000 |
| TotalCharges | 0.000 | 0.000 | 6.890 | 0.000 |
| InternetServiceDSL | -0.190 | 0.029 | -6.438 | 0.000 |
| PaymentMethodElectronic check | -0.077 | 0.013 | -5.815 | 0.000 |
| PaperlessBilling | 0.054 | 0.010 | 5.362 | 0.000 |
| TechSupport | -0.066 | 0.013 | -5.173 | 0.000 |
| ContractTwo year | 0.074 | 0.017 | 4.355 | 0.000 |
| SeniorCitizen | -0.052 | 0.013 | -3.990 | 0.000 |
| MonthlyCharges | -0.002 | 0.001 | -3.825 | 0.000 |
| tenure | 0.002 | 0.000 | 3.556 | 0.000 |
| OnlineBackup | -0.032 | 0.012 | -2.706 | 0.007 |
| Dependents | -0.023 | 0.010 | -2.234 | 0.026 |
| PaymentMethodMailed check | 0.012 | 0.015 | 0.794 | 0.427 |
| PaymentMethodCredit card (automatic) | 0.007 | 0.014 | 0.529 | 0.597 |
| MultipleLines | 0.002 | 0.010 | 0.190 | 0.850 |
After examining this model, looking at the residual plots we can see that the data is not perfect and there are some problems. There are some high values at the right of the Q-Q plot that may be due to the truncated nature of the data. The deviations from the line represent departures from normality. The curved nature of the Q-Q plot may suggest that there is less variance than expected.
For the Residuals vs Fitted plot, we can see two very distinct lines which means that there is a problem with the data set. Ideally, the plot would show the residuals randomly scattered to indicate a consistent and unbiased fit but we do not see that here.
Reducing the predictors that did not help with prediction of customer churn actually had a negative impact on our fit statistics (R-square and RMSE (root mean squared error)) as they both slightly decreased by 1%.
From the following table, we can see the effect on Customer Churn by the predictor variables.
| Variable | Direction |
|---|---|
| ContractOne year | Decrease |
| TotalCharges | Increase |
| MonthlyCharges | Increase |
| MultipleLines | Increase |
| TechSupport | Decrease |
| PaymentMethodElectronic check | Increase |
| InternetServiceDSL | Decrease |
| PaperlessBilling | Increase |
| ContractTwo year | Decrease |
| tenure | Decrease |
| SeniorCitizen | Increase |
| OnlineBackup | Decrease |
| InternetServiceFiber optic | Decrease |
| Dependents | Decrease |
| PaymentMethodMailed check | Decrease |
| PaymentMethodCredit card (automatic) | Decrease |
In Conclusion, we can see that our predictors do help to predict a customers tenure, with an r-squared value of .87. The most significant predictors for tenure are TotalCharges, MonthlyCharges, Contract, and PaymentMethod.
From this analysis, we can see that as these variables increase they:| Decrease_Tenure | Increase_Tenure |
|---|---|
| If the customer has Multiple Lines | Customers Total Charges |
| Whether the customer has churned (Churn) | Customer Monthly Charges |
| If they are a Senior Citizen (SeniorCitizen) | Length of Contract |
| N/A | Customers Payment Method |
| N/A | If the customer has a partner (Partner) |
In Conclusion, we can see that our predictors do help to predict whether a customer will churn, with Tenure, InternetService, and Contract being the most significant variables.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:| Decrease_Prob_to_Churn | Increase_Prob_to_Churn |
|---|---|
| Customer Tenure | Customers Total Charges |
| Whether the customer has online security (OnlineSecurity) | Customers Payment Method |
| Whether the customer is a senior citizen (SeniorCitizen) | The length of customer contract (Contract) |
| Whether the customer has internet service (InternetService) | Whether the customer has paperless billing (PaperlessBilling) |
| Whether the customer has multple phone lines (MultipleLines) | N/A |
---
title: "Belden INFO 3200 Project"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
telcodf <- read_csv("Belden-Telco-Customer-Churnv4.csv")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=250}
-----------------------------------------------------------------------
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
The goal of the analysis is to determine which variables are best at predicting whether a customer will “churn”. Churn means the customer stopped using the company's product or service. The dataset is for a telecommunications company called Telco so in the context of the data churn means that the customer terminated their subscription. With this analysis, Telco will have a better idea of what factors contribute to their clients leaving and therefore develop a plan to retain them.
#### The Data
This dataset has 7,044 and 20 variables. For this analysis, we will ignore the `CustomerID` variable which serves as a unique identifier for each customer and does not provide any meaningful information for predicting customer churn.
#### Data Sources
Author(s): Steven Macko
Title: Telco Customer Churn
Year: 2018
Version: 1
Publisher: IBM/Kaggle
URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
### The Data
VARIABLES TO PREDICT WITH
* **Gender**: Customer's gender (Male/Female)
* **SeniorCitizen**: Whether the customer is a senior citizen (Yes(0)/No(1))
* **Partner**: Whether the customer has a partner (Yes(0)/No(1))
* **Dependents**: Whether the customer has dependents (Yes(0)/No(1))
* **PhoneService**: Whether the customer has phone service (Yes(0)/No(1))
* **MultipleLines**: Whether the customer has multiple phone lines (Yes(0)/No(1)/No phone service(2))
* **InternetService**: If Customer Has Internet and the Type (DSL/Fiber optic/No(1))
* **OnlineSecurity**: If Customer Has Security on Their Internet (Yes(0)/No(1)/No Internet(2))
* **OnlineBackup**: If Customer Has a Backup for Their Internet (Yes(0)/No(1)/No Internet(2))
* **DeviceProtection**: If Customer has Device Protections Services for Their Internet (Yes(0)/No(1)/No Internet(2))
* **TechSupport**: Whether Customer has Tech Support Enabled on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **StreamingTV**: If the Customer Has StreamingTV Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **StreamingMovies**: If the Customer Has Movie Streaming Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **Contract**: Type of contract (Month-to-month/One year/Two years)
* **Paperless**: If the Customer Has Opted for Paperless Billing (Yes(0)/No(1))
* **PaymentMethod**: Payment method (Electronic check/Mailed check/Bank transfer/Credit card)
* **MonthlyCharges**: Monthly charges for the customer (continuous variable)
* **TotalCharges**: Total charges accumulated by the customer (continuous variable)
VARIABLES WE WANT TO PREDICT
* *Churn*: Whether the customer churned (Yes(0)/No(1)), Quantitative, response variable)
* *Tenure*: # of Months with Subscription (continuous, response variable)
Data
=======================================================================
Column {data-width=650}
-----------------------------------------------------------------------
### Organize the Data
Organizing data can also include summarizing data values in simple one-way and two-way tables.
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#Clean data by replacing spaces with decimals
colnames(telcodf) <- make.names(colnames(telcodf))
#View data
summary(telcodf)
#remove customerID due to it being an identifier
telcodf <- select(telcodf, -customerID)
```
From this data we can see that our variables have a variety of different values and a wide variety of variable types. CustomerID is a unique identifier for each customer and thus serves no purpose in our analysis so we will remove it. Our two dependent variables are Churn and Tenure. Churn is binary with 0 being Yes they did churn and 0 being no they did not churn. Tenure is continuous and we can see that the max tenure is 72 months and the median is 29 months of being subscribed to Telco's service. SeniorCitizen, PhoneService, Partner, and Dependents were recoded from yes/no to 0/1 for easier analysis. Similarly, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies were recoded from Yes/No/No Internet to 0/1/2 for easier analysis. Moreover, MonthlyCharges has a max of 118.75 and a median of 70.35 while TotalCharges has a max of 8684.8 and a median of 1397.5
Column {data-width=350}
-----------------------------------------------------------------------
### Transform Variables
My RStudio was having problems running the code with the variables as factors. Therefore, I had to recode my excel file to replace the characters of these variables with numbers and set them as numeric.
```{r, cache=TRUE}
telcodf <- mutate(telcodf,
SeniorCitizen = as.numeric(SeniorCitizen),
Partner = as.numeric(Partner),
Dependents = as.numeric(Dependents),
MultipleLines = as.numeric(MultipleLines),
OnlineSecurity = as.numeric(OnlineSecurity),
OnlineBackup = as.numeric(OnlineBackup),
DeviceProtection = as.numeric(DeviceProtection),
TechSupport = as.numeric(TechSupport),
StreamingTV = as.numeric(StreamingTV),
StreamingMovies = as.numeric(StreamingMovies),
PaperlessBilling = as.numeric(PaperlessBilling),
Churn = as.numeric(Churn))
```
#### Customer Churn(Yes(0)/No(1)) & InternetService(DSL,Fiber Optic, No Internet(1))
```{r cache=TRUE}
as_tibble(select(telcodf,Churn) %>%
table())
as_tibble(select(telcodf,InternetService) %>%
table())
```
#### Customer Churn (Yes or No)
<!--Instructions to import .jpg or .png images
use getwd() to see current path structure
copy file into same place as .Rmd file
put the path to this file in the link
format:  -->

Data Viz #1
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### Churn Yes(0)/No(1)
```#{r, cache=TRUE}
#as_tibble(select(telcodf,Churn) %>%
##ggplot(aes(y=n,x=Churn)) + geom_bar(stat="identity")
```
We can see we have about 73% of the data as no customer churn and 26.5% that have churned. Looking at the potential predictors related to Customer Churn, we strongest relationships between Tenure, MonthlyCharges, and TotalCharges. The rest of the variable comparisons are further down.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(telcodf,Churn,tenure,MonthlyCharges,TotalCharges,Contract))
```
Data Viz #2
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### Tenure & Churn
```{r, cache=TRUE}
ggplot(telcodf, aes(tenure)) + geom_histogram(bins=60)
ggplot(telcodf, aes(x = Churn)) + geom_bar()
```
We see the largest concentration of values are at the start and end of the histogram, at 0-20 years and 60-73 years. Looking at the potential predictors related to Tenure, the strongest relationship occurs between MonthlyCharges. Although the only two other continuous variables are MonthlyCharges and TotalCharges, MonthlyCharges has a relatively high correlation (.826) while TotalCharges is smaller (.248).The data also appears to be right skewed due to the concentration on the left of the histogram. We can see a large number of values around 73+ due to truncation of the tenure variable or perhaps because of unaccounted for noise. The large number at 0 is likely due to customers staying for less than a year.
The Churn variable is binary and thus cannot be made into a histogram. Based on the Churn bar chart, far more customers have not churned than have churned.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(telcodf,Churn,tenure,SeniorCitizen,Partner,Dependents,PhoneService))
ggpairs(select(telcodf,Churn,tenure,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection))
ggpairs(select(telcodf,Churn,tenure,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod))
```
Churn Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict Customer Churn (Yes(0)/No(1))
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Churn_lm <- lm(Churn ~ . ,data = telcodf)
summary(Churn_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Churn_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(Churn_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(Churn_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(Churn_lm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Churn_lm))[,4])
out <- coef(summary(Churn_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(Churn_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting customer churn, so a pruned version of the model is created by removing predictors that are not significant.
Row
-----------------------------------------------------------------------
### Predict Customer Churn Final Version
For this analysis we will use a pruned Linear Regression Model. The variables we removed include DeviceProtection (if customer has Device Protection), OnlineSecurity (If the customer has Online Security), StreamingTV (If they have Streaming TV support), PhoneService (if customer has phone service), Partner (if the customer has a partner), and Gender (Male or Female).
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Churn_lm <- lm(Churn ~ . -DeviceProtection -OnlineSecurity -StreamingMovies -StreamingTV -PhoneService -Partner -gender,data = telcodf)
summary(Churn_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Churn_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(Churn_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(Churn_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(Churn_lm)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Churn_lm))[,4])
out <- coef(summary(Churn_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(Churn_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, looking at the residual plots we can see that the data is not perfect and there are some problems. There are some high values at the right of the Q-Q plot that may be due to the truncated nature of the data. The deviations from the line represent departures from normality. The curved nature of the Q-Q plot may suggest that there is less variance than expected.
For the Residuals vs Fitted plot, we can see two very distinct lines which means that there is a problem with the data set. Ideally, the plot would show the residuals randomly scattered to indicate a consistent and unbiased fit but we do not see that here.
Reducing the predictors that did not help with prediction of customer churn actually had a negative impact on our fit statistics (R-square and RMSE (root mean squared error)) as they both slightly decreased by 1%.
From the following table, we can see the effect on Customer Churn by the predictor variables.
```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data.frame(
Variable = c('ContractOne year', 'TotalCharges', 'MonthlyCharges', 'MultipleLines', 'TechSupport','PaymentMethodElectronic check', 'InternetServiceDSL', 'PaperlessBilling','ContractTwo year', 'tenure', 'SeniorCitizen', 'OnlineBackup','InternetServiceFiber optic', 'Dependents', 'PaymentMethodMailed check','PaymentMethodCredit card (automatic)'),
Direction = c('Decrease', 'Increase', 'Increase', 'Increase', 'Decrease','Increase', 'Decrease', 'Increase', 'Decrease','Decrease','Increase', 'Decrease', 'Decrease', 'Decrease', 'Decrease','Decrease')
)
knitr::kable(predchang) #pretty table output
```
Tenure Analysis {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict Customer Tenure

Conclusion 1
=======================================================================
### Summary
In Conclusion, we can see that our predictors do help to predict a customers tenure, with an r-squared value of .87. The most significant predictors for tenure are TotalCharges, MonthlyCharges, Contract, and PaymentMethod.
From this analysis, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predtenurefnl = data_frame(Decrease_Tenure =
c("If the customer has Multiple Lines",
"Whether the customer has churned (Churn)",
"If they are a Senior Citizen (SeniorCitizen)",
"N/A", "N/A"
),
Increase_Tenure = c("Customers Total Charges",
"Customer Monthly Charges",
"Length of Contract",
"Customers Payment Method",
"If the customer has a partner (Partner)"
))
knitr::kable(predtenurefnl) #pretty table output
```
Additional Churn Analysis 1 {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict Churn - Logistic Regression

Conclusion 2
=======================================================================
### Summary
In Conclusion, we can see that our predictors do help to predict whether a customer will churn, with Tenure, InternetService, and Contract being the most significant variables.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predchurnfnl = data_frame(Decrease_Prob_to_Churn =
c("Customer Tenure",
"Whether the customer has online security (OnlineSecurity)",
"Whether the customer is a senior citizen (SeniorCitizen)",
"Whether the customer has internet service (InternetService)",
"Whether the customer has multple phone lines (MultipleLines)"
),
Increase_Prob_to_Churn = c("Customers Total Charges",
"Customers Payment Method",
"The length of customer contract (Contract)",
"Whether the customer has paperless billing (PaperlessBilling)",
"N/A"
))
knitr::kable(predchurnfnl) #pretty table output
```
Additional Churn Analysis 2 {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict Churn - Decision Tree
